mechanistic interpretability
Prompting as Scientific Inquiry
Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs--few-shot learning, chain-of-thought, constitutional AI--was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of organism--complex, opaque, and trained rather than programmed--then prompting is not a workaround.
Interpreting Emergent Features in Deep Learning-based Side-channel Analysis
Side-channel analysis (SCA) poses a real-world threat by exploiting unintentional physical signals to extract secret information from secure devices. Evaluation labs also use the same techniques to certify device security. In recent years, deep learning has emerged as a prominent method for SCA, achieving state-of-the-art attack performance at the cost of interpretability. Understanding how neural networks extract secrets is crucial for security evaluators aiming to defend against such attacks, as only by understanding the attack can one propose better countermeasures. In this work, we apply mechanistic interpretability to neural networks trained for SCA, revealing $\textit{how}$ models exploit $\textit{what}$ leakage in side-channel traces. We focus on sudden jumps in performance to reverse engineer learned representations, ultimately recovering secret masks and moving the evaluation process from black-box to white-box. Our results show that mechanistic interpretability can scale to realistic SCA settings, even when relevant inputs are sparse, model accuracies are low, and side-channel protections prevent standard input interventions.
This startup's new mechanistic interpretability tool lets you debug LLMs
This startup's new mechanistic interpretability tool lets you debug LLMs Goodfire wants to make training AI models more like good old-fashioned software engineering. The San Francisco-based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters--the settings that determine a model's behavior --during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible. Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model. LLMs contain a LOT of parameters. The company says its mission is to make building AI models less like alchemy and more like a science.
There Will Be a Scientific Theory of Deep Learning
Simon, Jamie, Kunin, Daniel, Atanasov, Alexander, Boix-Adserà, Enric, Bordelon, Blake, Cohen, Jeremy, Ghosh, Nikhil, Guth, Florentin, Jacot, Arthur, Kamb, Mason, Karkada, Dhruva, Michaud, Eric J., Ottlik, Berkan, Turnbull, Joseph
In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.
Identifying interactions at scale for LLMs
Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX, algorithms capable of identifying these critical interactions at scale. We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions.
Compact Proofs of Model Performance via Mechanistic Interpretability
We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-$K$, validating proof transferability across 151 random seeds and four values of $K$.We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models.Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds.We confirm these connections by qualitatively examining a subset of our proofs.Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
Meet the new biologists treating LLMs like aliens
By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time. How large is a large language model? Think about it this way. In the center of San Francisco there's a hill called Twin Peaks from which you can view nearly the entire city. Picture all of it--every block and intersection, every neighborhood and park, as far as you can see--covered in sheets of paper. Now picture that paper filled with numbers. LLMs contain a LOT of parameters. That's one way to visualize a large language model, or at least a medium-size one: Printed out in 14-point type, a 200-billion-parameter model, such as GPT4o (released by OpenAI in 2024), could fill 46 square miles of paper--roughly enough to cover San Francisco.
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago.